137 research outputs found

    Inference of demographic history from genealogical trees using reversible jump Markov chain Monte Carlo

    Get PDF
    Background: Coalescent theory is a general framework to model genetic variation in a population. Specifically, it allows inference about population parameters from sampled DNA sequences. However, most currently employed variants of coalescent theory only consider very simple demographic scenarios of population size changes, such as exponential growth. Results: Here we develop a coalescent approach that allows Bayesian non-parametric estimation of the demographic history using genealogies reconstructed from sampled DNA sequences. In this framework inference and model selection is done using reversible jump Markov chain Monte Carlo (MCMC). This method is computationally efficient and overcomes the limitations of related non-parametric approaches such as the skyline plot. We validate the approach using simulated data. Subsequently, we reanalyze HIV-1 sequence data from Central Africa and Hepatitis C virus (HCV) data from Egypt. Conclusions: The new method provides a Bayesian procedure for non-parametric estimation of the demographic history. By construction it additionally provides confidence limits and may be used jointly with other MCMC-based coalescent approaches

    Learning causal networks from systems biology time course data: an effective model selection procedure for the vector autoregressive process

    Get PDF
    Background: Causal networks based on the vector autoregressive (VAR) process are a promising statistical tool for modeling regulatory interactions in a cell. However, learning these networks is challenging due to the low sample size and high dimensionality of genomic data. Results: We present a novel and highly efficient approach to estimate a VAR network. This proceeds in two steps: (i) improved estimation of VAR regression coefficients using an analytic shrinkage approach, and (ii) subsequent model selection by testing the associated partial correlations. In simulations this approach outperformed for small sample size all other considered approaches in terms of true discovery rate (number of correctly identified edges relative to the significant edges). Moreover, the analysis of expression time series data from Arabidopsis thaliana resulted in a biologically sensible network. Conclusion: Statistical learning of large-scale VAR causal models can be done efficiently by the proposed procedure, even in the difficult data situations prevalent in genomics and proteomics. Availability: The method is implemented in R code that is available from the authors on request

    Gene network reconstruction from microarray data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Often, software available for biological pathways reconstruction rely on literature search to find links between genes. The aim of this study is to reconstruct gene networks from microarray data, using Graphical Gaussian models.</p> <p>Results</p> <p>The <it>GeneNet </it>R package was applied to the Eadgene chicken infection data set. No significant edges were found for the list of differentially expressed genes between conditions MM8 and MA8. On the other hand, a large number of significant edges were found among 85 differentially expressed genes between conditions MM8 and MM24.</p> <p>Conclusion</p> <p>Many edges were inferred from the microarray data. Most of them could, however, not be validated using other pathway reconstruction software. This was partly due to the fact that a quite large proportion of the differentially expressed genes were not annotated. Further biological validation is therefore needed for these networks, using for example in vitro invalidation of genes.</p

    Probabilistic modeling and machine learning in structural and systems biology

    Get PDF
    This supplement contains extended versions of a selected subset of papers presented at the workshop PMSB 2007, Probabilistic Modeling and Machine Learning in Structural and Systems Biology, Tuusula, Finland, from June 17 to 18, 2006

    Identifying Modules of Coexpressed Transcript Units and Their Organization of Saccharopolyspora erythraea from Time Series Gene Expression Profiles

    Get PDF
    BACKGROUND: The Saccharopolyspora erythraea genome sequence was released in 2007. In order to look at the gene regulations at whole transcriptome level, an expression microarray was specifically designed on the S. erythraea strain NRRL 2338 genome sequence. Based on these data, we set out to investigate the potential transcriptional regulatory networks and their organization. METHODOLOGY/PRINCIPAL FINDINGS: In view of the hierarchical structure of bacterial transcriptional regulation, we constructed a hierarchical coexpression network at whole transcriptome level. A total of 27 modules were identified from 1255 differentially expressed transcript units (TUs) across time course, which were further classified in to four groups. Functional enrichment analysis indicated the biological significance of our hierarchical network. It was indicated that primary metabolism is activated in the first rapid growth phase (phase A), and secondary metabolism is induced when the growth is slowed down (phase B). Among the 27 modules, two are highly correlated to erythromycin production. One contains all genes in the erythromycin-biosynthetic (ery) gene cluster and the other seems to be associated with erythromycin production by sharing common intermediate metabolites. Non-concomitant correlation between production and expression regulation was observed. Especially, by calculating the partial correlation coefficients and building the network based on Gaussian graphical model, intrinsic associations between modules were found, and the association between those two erythromycin production-correlated modules was included as expected. CONCLUSIONS: This work created a hierarchical model clustering transcriptome data into coordinated modules, and modules into groups across the time course, giving insight into the concerted transcriptional regulations especially the regulation corresponding to erythromycin production of S. erythraea. This strategy may be extendable to studies on other prokaryotic microorganisms

    NetDiff – Bayesian model selection for differential gene regulatory network inference

    Get PDF
    Differential networks allow us to better understand the changes in cellular processes that are exhibited in conditions of interest, identifying variations in gene regulation or protein interaction between, for example, cases and controls, or in response to external stimuli. Here we present a novel methodology for the inference of differential gene regulatory networks from gene expression microarray data. Specifically we apply a Bayesian model selection approach to compare models of conserved and varying network structure, and use Gaussian graphical models to represent the network structures. We apply a variational inference approach to the learning of Gaussian graphical models of gene regulatory networks, that enables us to perform Bayesian model selection that is significantly more computationally efficient than Markov Chain Monte Carlo approaches. Our method is demonstrated to be more robust than independent analysis of data from multiple conditions when applied to synthetic network data, generating fewer false positive predictions of differential edges. We demonstrate the utility of our approach on real world gene expression microarray data by applying it to existing data from amyotrophic lateral sclerosis cases with and without mutations in C9orf72, and controls, where we are able to identify differential network interactions for further investigation

    A close examination of double filtering with fold change and t test in microarray analysis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Many researchers use the double filtering procedure with fold change and <it>t </it>test to identify differentially expressed genes, in the hope that the double filtering will provide extra confidence in the results. Due to its simplicity, the double filtering procedure has been popular with applied researchers despite the development of more sophisticated methods.</p> <p>Results</p> <p>This paper, for the first time to our knowledge, provides theoretical insight on the drawback of the double filtering procedure. We show that fold change assumes all genes to have a common variance while <it>t </it>statistic assumes gene-specific variances. The two statistics are based on contradicting assumptions. Under the assumption that gene variances arise from a mixture of a common variance and gene-specific variances, we develop the theoretically most powerful likelihood ratio test statistic. We further demonstrate that the posterior inference based on a Bayesian mixture model and the widely used significance analysis of microarrays (SAM) statistic are better approximations to the likelihood ratio test than the double filtering procedure.</p> <p>Conclusion</p> <p>We demonstrate through hypothesis testing theory, simulation studies and real data examples, that well constructed shrinkage testing methods, which can be united under the mixture gene variance assumption, can considerably outperform the double filtering procedure.</p

    Constraint-based probabilistic learning of metabolic pathways from tomato volatiles

    Get PDF
    Clustering and correlation analysis techniques have become popular tools for the analysis of data produced by metabolomics experiments. The results obtained from these approaches provide an overview of the interactions between objects of interest. Often in these experiments, one is more interested in information about the nature of these relationships, e.g., cause-effect relationships, than in the actual strength of the interactions. Finding such relationships is of crucial importance as most biological processes can only be understood in this way. Bayesian networks allow representation of these cause-effect relationships among variables of interest in terms of whether and how they influence each other given that a third, possibly empty, group of variables is known. This technique also allows the incorporation of prior knowledge as established from the literature or from biologists. The representation as a directed graph of these relationship is highly intuitive and helps to understand these processes. This paper describes how constraint-based Bayesian networks can be applied to metabolomics data and can be used to uncover the important pathways which play a significant role in the ripening of fresh tomatoes. We also show here how this methods of reconstructing pathways is intuitive and performs better than classical techniques. Methods for learning Bayesian network models are powerful tools for the analysis of data of the magnitude as generated by metabolomics experiments. It allows one to model cause-effect relationships and helps in understanding the underlying processes

    Randomization in Laboratory Procedure Is Key to Obtaining Reproducible Microarray Results

    Get PDF
    The quality of gene expression microarray data has improved dramatically since the first arrays were introduced in the late 1990s. However, the reproducibility of data generated at multiple laboratory sites remains a matter of concern, especially for scientists who are attempting to combine and analyze data from public repositories. We have carried out a study in which a common set of RNA samples was assayed five times in four different laboratories using Affymetrix GeneChip arrays. We observed dramatic differences in the results across laboratories and identified batch effects in array processing as one of the primary causes for these differences. When batch processing of samples is confounded with experimental factors of interest it is not possible to separate their effects, and lists of differentially expressed genes may include many artifacts. This study demonstrates the substantial impact of sample processing on microarray analysis results and underscores the need for randomization in the laboratory as a means to avoid confounding of biological factors with procedural effects

    Randomization in Laboratory Procedure Is Key to Obtaining Reproducible Microarray Results

    Get PDF
    The quality of gene expression microarray data has improved dramatically since the first arrays were introduced in the late 1990s. However, the reproducibility of data generated at multiple laboratory sites remains a matter of concern, especially for scientists who are attempting to combine and analyze data from public repositories. We have carried out a study in which a common set of RNA samples was assayed five times in four different laboratories using Affymetrix GeneChip arrays. We observed dramatic differences in the results across laboratories and identified batch effects in array processing as one of the primary causes for these differences. When batch processing of samples is confounded with experimental factors of interest it is not possible to separate their effects, and lists of differentially expressed genes may include many artifacts. This study demonstrates the substantial impact of sample processing on microarray analysis results and underscores the need for randomization in the laboratory as a means to avoid confounding of biological factors with procedural effects
    • …
    corecore